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Abstract 

We consider the problem of classification when inputs correspond to sets of vectors. This setting 
occurs in many problems such as the classification of pieces of mail containing several pages, of web 
sites with several sections or of images that have been pre-segmented into smaller regions. We propose 
generalizations of the restricted Boltzmann machine (RBM) that are appropriate in this context and 
explore how to incorporate different assumptions about the relationship between the input sets and 
the target class within the RBM. In experiments on standard multiple-instance learning datasets, we 
demonstrate the competitiveness of approaches based on RBMs and apply the proposed variants to the 
problem of incoming mail classification. 



1 Introduction 



The vast majority of machine learning algorithms are developed in the context vi^here each input can be 
assumed to take the form of a fixed-size vector x. In some appHcations however, such an assumption does 
not hold and inputs cannot easily be processed into this form. In this paper, we consider one such setting 
where inputs consist in an unordered and variable-length set of vectors X = {x^^^ , • • • , x^l-^^l^}. For instance, 
X could be the set of text documents x'-'^^ found in some incoming piece of mail, where each document is 
represented as a bag of words. In this particular example, a simple approach to converting the set X into 
a single vector x would consist in computing the global bag of word representation of all documents in X, 
as if all documents had been concatenated into a single one. This would however correspond to throwing 
away all the information about the structure of the incoming mail, which could be useful to solve the task 
at hand. This problem setting is not specific to text data either: X could correspond to a collection of 
images or to a single image that has been pre-segmented, and some recognition tasks in computer vision 



have previously been formulated in terms of classification of sets (Kondor and Jebara, 2003 Wallraven 



et al. , 2003). Another example is text-independent speaker recognition (Reynolds, 1995), where inputs 



are sequences of acoustic vectors but for which the order is not relevant: relevant short-term dynamics 
are taken into account in the vector features themselves (e.g. spectral coefficients and their derivatives) 
and long-term dynamics are not useful for classification (the succession of these features is informative of 
the speech content, not the speaker identity). 

A popular approach to classifying sets has been that of multiple-instance learning (MIL). In MIL, 
binary classification of sets of vectors is performed by assuming that a set belongs to the positive class if 



at least one element of the set belongs to that positive class. Otherwise the set belongs to the negative 
class (i.e. all elements of the set are from the negative class). This problem was originally motivated in 
the context of drug activity prediction ( Dietterich et al^ 1997), where a drug molecule can take several 
shapes but only some of them might allow the molecule to bind with some given protein associated with a 
disease. A drug molecule can then be represented as the set of its potential shapes and this set will have 
a positive label only if at least one of its shapes allow binding. 

The MIL approach makes the implicit assumption that the presence of just a single positive example 
is sufficient to recognize the whole set as positive. However, this assumption is not always appropriate. 
For instance, each vector could only provide partial class information, such that the observation of only a 
single informative vector is not enough to label the whole set as positive. 

In this paper, we describe extensions of the restricted Boltzmann machine that perform multiclass 
classification of sets and does not assume that sufficient discriminative information is present in a single 
element of the set. By learning a latent representation of its input, these extensions can deal with cases 
where only partial evidence of class membership is present in only a few set vectors. We report competitive 
results on some common MIL datasets and present an application of these models to a mail classification 
problem. 



2 Classification with Restricted Boltzmann Machines 



In this work, we build on a specific restricted Boltzmann machine (RBM) that can be used to perform 
classification ( [Larochelle and Bengio , 2008 ; Tieleman , 2008 ) . We will refer to this RBM as a classification 
RBM (ClassRBM). 

The ClassRBM is an energy-based probabilistic model where a layer h of i7 binary hidden units are 
used to model the joint distribution of a vector of D inputs x and a target vector y of size C. The target 
y corresponds to a class label and takes the "one out of C" representation, meaning that if x belongs to 
class c, then y = Cc where Cc is a vector with all values set to except at position c, which is set to 1. For 
simplicity, we will also assume that x is a binary vector, though generalizations to other types of vectors 



are possible ( Welling et al. 2005 ) . 



Using the energy function 

E{yL, y, h) = -d^y - c^h - b^x - Wx - h^Uy, 
the probability for some configuration of x, y and h is defined as 

y, h) = exp(-£;(x, y, h))/Z 



where Z is a normalizing constant that ensures p(x, y,h) defines a valid distribution. Figure [T] shows an 
illustration of a ClassRBM. 

Though Z (and hence p(x, y, h)) is usually intractable to compute, the following conditional distribu- 
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Figure 1: Illustration of a standard ClassRBM. An example of activations for the hidden layer units is 
given (black means hidden unit is equal to 1). 



tions of the model are themselves tractable: 

P(h|x,y) 

p{hj = l|x,y) 

p(x,y|h) 

p{xi = l|h) 
p{y = ec|h) 



H 

JJp(/ij|x,y) 

i=i 

sigm(cj + Wj.x + Uj.y) 

D 

^(ylh) JJp(xi|h) 

i=l 

sigm(6j + h^W.i) 

exp(d'ry + h^Uy) 



where sigm(?;) = 1/(1 + exp(— w )) and we use the notation Aj. to refer to the j*^ row of matrix A and A.j 
for it's i^^ column. 

Given Q, it can also be shown that the posterior class probability distribution given some input x 
has the closed form 



|x) = ^p(y = ec,h|x) 



exp(-F(x,y)) 



Ec'=i...cexp(-i^(x,ecO) 



where -F(x, y) is referred to as the free-energy 

H 

F(x, y) = -d^y - ^ softplus {cj + W^- .x + Uj,.y) 
j=i 

with softplus(w) = log(l + exp(t')). 

In order to train the ClassRBM, different strategies can be followed. A first option is to train it discrim- 
inatively, by minimizing the average negative conditional log-likelihood — logp(yt|xt) of the parameters 
for the available training data {xt,yt}- This can be achieved by simple stochastic gradient descent. 
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A second option is to train the ClassRBM generatively, by minimizing the negative joint log-likehhood 
— logp(x(, yt). Unfortunately, the necessary gradients cannot be computed exactly. The Contrastive 
Divergence (CD) algorithm (Hinton et al. , [2006 ) however provides a useful approximation 



d -logp{xt,yt) 



- dE{Sct,yt,h) - 
89 

where 9 is any parameter of the ClassRBM and where x^ and yt is the result of a one-step Gibbs sampling 
chain, initialized at the training example xt and yt- Noting ht = sigm(c + Wxj + Uy^) and = 
sigm(c + Wxf + Uyt), we get the following stochastic gradient update from CD: 



b f 


- b 


+ 


A (xt- 




c i- 


- c 


+ 


A {ht- 


ht) 


d ^ 


- d 


+ 


A(yt- 


yt) 


W f 


- W 


+ 


A [ht^] 


" - htX^ 


U f 


- u 


+ 


A {hty] 


- hty,^ 



where A is the stochastic gradient learning rate used for generative training. 



As argued by Larochelle and Bengio (2008), in some situations, neither discriminative or generative 
learning alone are optimal and better performance can be achieved by using a linear combination of both 
objectives. This is referred to as hybrid generative/discriminative learning and corresponds to performing 
both the discriminative and generative parameter updates with separate learning rates. 



3 Generalization of the ClassRBM to handle sets (ClassSetRBM) 

Now, we wish to generalize the ClassRBM so that it can model the distribution of a set X = {x^^^, . . . , x^l-^l^} 
with target vector y. The simplest approach would be to connect each vector x^*^ to some global hidden 
layer h with the same connection matrix W. However, this approach is not appropriate because, not only 
do we expect sets to have varying sizes, but also the number of vectors x*^'^^ in X that actually contain 
predictive information about y will also vary. By having just a single global hidden layer, the activity of 
hidden units would thus tend to over-saturate for sets of large size. 

To address this issue, we propose two generalizations where each vector x^**) of a set will be connected 
to its own "copy" of the hidden layer. The number of hidden layers will then depend on the size of the 
inpulj^ All hidden layers will be connected to its corresponding input vector by the same matrix W. 

Given this approach, there are still different design choices to be made as to how these hidden layers 
should interact with the target units y. We present here two such choices, which correspond to different 
assumptions about the nature of the interaction between the input sets and the target. 

3.1 ClassSetRBM with Mutually Exclusive Hidden Units (XOR) 

If we believe that the vectors in the input set X all contain information of a distinct nature, then a hidden 
feature detected within one vector x^^^ would be expected to be absent in the other vectors of the set. In 
this case, the set structure would convey very useful information about how to perform classification. 

^The implication of this is that we always condition on the size of the input set. 
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Figure 2: Illustration of ClassSetRBM'^^^. Dotted lines connect hidden layers whose activity is subject 
to joint constraints. An example of valid activations for the hidden layer units is given. 



To exploit such information, we could impose that the activity of hidden units be mutually ex- 
clusive across the vectors of the set. Noting h^^^ the hidden layer to which x^'*) is connected and 
H = {h'-"'^^ . . . jh*-!-^!^} the set of hidden layer vectors, this would translate into requiring the constraint 
that 

El=Uf e{o,i} Vi = i...// 

i.e. for all hidden unit position j, at most one hidden unit /i^-*^ should be active across all vectors x^*\ 
With that, we define the energy function 

E(X, y, H) = -d"^y - ^ b"^xW - ^ c^h^^) - J] (h^^)^ Wx^ + h(^)'"uy 



where the target y is connected to all hidden layers through the same connection matrix U. We will 
refer to this variant of the ClassRBM for sets as ClassSetRBM'^^^. See Figure [2] for an illustration of 
ClassSetRBM^O^. 

While more complicated than the ClassRBM for single vectors, it can be shown that ClassSetRBM"^^^ 
has simple conditional distributions as well. The hidden layers conditional distribution becomes 

p(H|X,y) = nf=iP({/^5'^}l='i|X,y) 

(s) _ ilY — exp(actj(x(^),y)) 



= l|X,y 
=0|X,y 



1+El7liexp(act,(x('''),y)) 
1 



1+El,licxp(act,{x('''),y)) 

where actj{'x.^^\y) = Cj + Wj.x^'*) + Uj.y and the statement h^-'^ = is a shorthand for /i^-*'* = Vs 



5 



1, . . . , |X|. The input and target vectors' distribution are 

p(X,y|H) = p(y|H)nE'inf=iP(^f^|hW) 
= l|h(^)) = sigm(6i + h(^)^W.,) 

_ e IH) - exp(d^y+Ei^lh(^-)^Uy) 

p[y-e,\i±) - ^C^^exp(dTe,,+ESh(»)-Ue,,)- 

These conditional distributions are simple enough that Gibbs sampling can be performed, by sampling 

each element of H given X and y, and then sampling new values for the vectors in X and for y. 

The target posterior p(y|X) can also be computed efficiently. It can be shown that it has the following 
form 

p(y = e,|X)- «P(-^™"(X,y)) 



Ec'.l.,.c<!='P{-*'™''(X,ec.)) 
where the free-energy IS now 

H 

^xoRj^X, y) = -d^y - softplus (softmaxj(X) + Uj.y) 

with softmaxj(X) = log(^|^']^ exp(cj + W^-.x^''))) can be seen as a soft version of the max function of 
Cj + Wj.x^'') over the set of input vectors. 

As before, given a training set of pairs {Xt,yt}, it is possible to train ClassSetRBM"^*-*^ discrimi- 
natively and generatively. The discriminative gradients are easily computed using the chain rule. CD 
approximations for the generative learning updates can also be obtained, since Gibbs sampling can be 
performed. 

We note % = {Xf^ ■ ■ • and Yt as the result of one step of Gibbs sampling initialized at the 

training example pair (Xi,yt). Similarly to the standard ClassRBM case, we also note hj*"* and h^^^ as 
the vector containing the conditional probabilities of the hidden units being equal to 1, conditioned on 
(Xf,yt) and (Xt,yt) respectively. Then, the Contrastive Divergence (CD) learning updates are computed 
as follows: 

A (xf^ - 5^^) 
A (hS^) - hW) 
A (yt - yt) 

A E.(h;^^xf)T_hWxWT) 

AE.(hiV-hi^^yt^)- 
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^ d 
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^ W 
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^ U 


+ 



3.2 ClassSetRBM for Sets with Redundant Evidence (OR) 

There might be cases where the assumption of mutual exclusivity of the hidden units is too strong. One 
simple such case would be if additional copies of vectors were inserted in the set. More generally, it could 
be that the same useful hidden feature is present in several vectors within the input set. In this case, the 
actual number of vectors containing this evidence is not relevant, only the presence of that evidence in at 
least one set element is. It might then be desirable to have a model that is more robust to such situations. 
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Figure 3: Illustration of ClassSetRBM^^. Compared to Figure (ij this RBM has a set of pairs of hidden 
layers (one for the inputs, one for the target), with additional constraints within pairs. The multual 
exclusivity constraint only applies to the target hidden layers. An example of valid activations for the 
hidden layer units is also given. 



To achieve this, we must somehow remove mutual exclusivity over the vectors x^^^ but maintain it for 
the connections with the target y. This can be accomplished by having additional hidden layer "copies" 
G = {g^^\ . . . ,g^'"^'^} connected only to y and removing the direct connections of H to y, yielding the 
new energy function 



ii;(X,y,H,G) 

Then, dependencies between X and y are modeled by the hidden units through the imposition of the 
following constraints in the activities in H and G: 



E™is!"e{ci.i} 



hr' > g 



yj = l...H 

= 1...H, Vs = 1,...,|X| . 



(s) (s) 

Hence, for a target hidden unit Qj to be active, at least one input hidden units hj will need to be active. 
We call this second model ClassSetRBM°^. Also see Figure 3 for an illustration of ClassSetRBM . 
It can be shown that ClassSetRBM^^ has the following conditionals over G and H: 



P(5 



p(G|X,y) 
l|X,y) 
0|X,y) 



p(H|G,X,y) 

p(/i« = l|<7;.'\x(^)) 



nf=iP({5f}l='i|X,y) 

cxp(actj(x(^),y)) 
l+Ei^Iiexp(act,(x(='),y)) 

1 

1+Ei^licxp(act,(x(='),y)) 

nE'inf=iP(M^^i#,xW) 



(s) 



1, if 5 

sigm(cj ■ + Wj.x^*)), else 
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where we now have actj(x(*\y) = softminus(cj + Wj.x^*)) + Uj.y, with softminus(a) = a — softplus(a). 
The conditionals for X and y are the same as in Section [3. 1[ with the exception that y is conditioned on 
G, not H. 

The class posterior is also tractable. It is the same as in Section 3.1 , but with a free-energy F^^(X, y) 



where the softmaXj(X) function is now 



softmaxj(X) = log ( Y^^i exp(softminus(cj + Wj.x^''))) ) . 



Once again, discriminative and generative learning can be performed. When performing Gibbs sam- 
pling to compute the CD updates, the hidden layers G and H samples are obtained by first sampling from 
p(G|X,y) and then sampling from p(H|G, X, y). Again denoting (h^^\ gl''^) and (h^*\g|*^) the vectors of 
hidden probabilities from conditioning on (X(,yt) and (Xj,yt) respectively, we obtain the following CD 
updates: 

A - hj^)) 

A (yt - yt) 

A E.(hf^x;^)T_hW~(.)T) 



b f 


- b 
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- c 
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d f 


- d 
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W f 


- W 
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U ^ 


- u 
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3.3 Variants with "hard" max pooHng 

In both proposed models, the class posterior p(y|X) require that some hidden units be implicitly "pooled" 
by taking a soft version of the maximum over the input set. Specifically, this is achieved through the 

IX 

los(Es=i ^xp(-)) operation in their definitions of softmaXj(X). 

In practice, softmax pooling has the disadvantage that at the beginning of training, pooling essentially 
corresponds to summing the activations of all hidden units and does not actually select a single hidden 
unit. This potentially makes it harder for the hidden units to specialize. 

Hence, we also consider "hard" max variants of ClassSetRBM^°^ and ClassSetRBM^^, referred to 
as ClassSetRBM'^°^*and ClassSetRBM*-*^* respectively, where the log(Ei=i 6xp(-)) operation is replaced 
by a maXj,=i jx|(") operation in their definition of softmaxj(X). This modification is only applied for 
computing p(y|X) and the discriminative gradients. This can be understood as an estimation of the true 
class posterior p(y|X) by approximating parts of the sum over all the hidden units with a maximum, and 
optimizing the conditional log-likelihood of that approximation. 



4 Related Work 



As previously mentioned, the problem of classifying inputs corresponding to sets is closely related to that 
of multiple-instance learning (MIL). The standard case is binary classification, where a set of training 
vectors is labeled positive if and only if at least one instance in the set is positive. The MIL has been 



studied to solve problems such as drug activity detection (Dietterich et al. , 1997) and natural scene 



categorization. In the last fifteen years, several types of approaches have been proposed to address MIL, 



such as Learning Axis-Parallel Concepts (Dietterich et al. , 1997), Diverse Density (Maron and Ratan 



1998) and its Expectation-Maximization version (Zhang and Goldman, 2001). Extensions of k- nearest 
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neighbours (Citation kNN in Wang and Zucker (2000)), Support Vector Machines (MI-SVM in Andrews 



et ah (2002)), decisions trees (Chevaleyre and Zucker 2001), perceptrons (Sabato et al. , 2010) and neural 



networks (Zhou and Zhang, 2002) have also been explored. Classification within these approaches mainly 
consists of computing the maximum output over the vectors in the set, and a loss function to optimize 
is expressed accordingly. The approach presented in this paper rather consists in performing the (soft) 
maximum pooling over the sets in an intermediate latent representation, instead of on the output posterior 



probabilities as proposed by (Zhou and Zhang, 2002 Sabato et al. , 2010). 



Concerning classification of sets in general, several kernels between sets of vectors have been proposed to 
generalize kernel-based classifiers (SVM) without modifying the standard optimization problem. There are 
kernels defined between probabilistic density functions estimated on each set of vectors, such as the Fisher 



kernels ( Jebara and Kondor 



kernel (Jaakkola and Haussler, 1998), Mutual Information kernels (Seeger, 2002), Probability product 



2004 



Lyu , 2005 ) and radial kernels based on a probabilistic distance such as 



Kullback-Leibler ( Kondor and Jebara , 2003 ) . These approaches have been developed with some particular 
families of density functions such as Gaussian distributions and Gaussian mixture models, and are not 
convenient when inputs are high-dimensional or sparse. There are also kernels based on combinations of 
kernels between inter-sets pairs of instances. These include several kinds of linear combination of kernels 



on inter-sets pairs of instances ( Louradour et al. 



2007 



Zhou et al. 2009) as well as max kernels (Wallraven 



et al. , 2003 ). The mi-Graph kernels of Zhou et al. ( 2009 ) actually achieves some of the best results reported 



on MIL tasks (Deselaers and Ferrari, 2010). 

The main disadvantage of such kernel-based SVM approaches is that they tend not to scale well with 
big datasets: the complexity of optimizing an SVM is quadratic in the number of training samples, and 
also the complexity of computing kernels between sets is quadratic in the number of vectors per set. 

Finally, in the RBM literature, Lee et al. (2009) also explored the use of soft (probabilistic) pooling 
operations in a convolutional RBM. The two models proposed here can be seen as other pooling-based 
RBMs that are appropriate when the inputs are sets. 



5 Experiments 

We present here experiments on standard MIL datasets as well as on the problem of mail classification, 
which motivated this work. To evaluate the proposed RBMs for sets, we compare them with the following 
baseline models: 

ClassRBM-poolIn: In this system, we simply apply ClassRBM described in section [2] on fixed-size 
vectors that are generated by pooling all the vectors in each input set using the maximum, minimum 
and average values of the vectors' features over that set. 



ClassRBM-mzixOut: This model is an implementation for a ClassRBM of the ideas in Zhou and Zhang 



( |2002| ); Sabato et al. (2010). The model is trained by gradient descent to predict the target based 



only on the input vector that gives the maximal output response. We also apply the same strategy 
with logistic regression (Logit-maxOut) and a one hidden layer perceptron (MLP-maxOut). Note 
that these methods are only applicable in the case of binary classification, so we only use these 
baselines for MIL problem^ 



^We tried some variants to generalize to multiclass, but the performance was always poor. 
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SVM-miGraph: This state-of-the-art SVM model based on a kernel between sets gave some of the best 



results on MIL tasks as reported by Deselaers and Ferrari (2010). The miGraph kernel (Zhou et al. 



2009D is: 



where the vectorial kernel k is the Gaussian kernel and where the weights assigned to a vector within 
a set is inversely proportional to the number of "edges" that can be drawn with the other vectors 
in the same set: 

WsiX.) = lj|xW-x(=')||<a(X) 

given an adaptive distance threshold (t(X) defined as the average pairwise distance within the set: 

SVM-miGraph2: This system is a variant of SVM-miGraph where the distance threshold to compute 
the graphs is the same for all set^(Tmi2(X) = cjo- This hyper-parameter is tuned by validation, such 
as the C in the SVM loss and the 7 of the Gaussian kernel. 



SVM- max : Like miGraph kernels, local kernels for sets (Wallraven et al. 2003) are computed from 
Gaussian kernel values on all inter-sets pairs of vectors. Instead of computing a weighted average, 
only the kernel values that are maximal for each vector are summed. 

In all our experiments, we perform k-fold cross validation, and at each fold the model hyper-parameters 
are optimized on a subset of training inputs (20%), not used to train the model. The reported results 
are obtained by averaging over all test fold examples. For all models except the SVMs, we train by 
stochastic gradient descent, and the hyper-parameters are the learning rate and the number of updates 
(early-stopping). The number of hidden neurons used was 100 (varying this number had little influence 
on the results). RBMs were either trained discriminatively only or using the hybrid objective. 

5.1 Experiments on MIL benchmark 

We start by evaluating our approach on the public and popular MIL dataset^ Muskl, Musk2 (drug ac- 
tivity prediction task) and Elephant, Fox, Tiger (image annotation task). We carried out 5-times repeated 
10- fold cross-validation. Table [ST] shows the results of the several proposed variants of ClassSetRBM and 
of the baseline models. For each dataset, the best performance is indicated in a gray cell, and results in 
bold are the ones with no significant difference with this best reference, based on a 95% two-sided Student's 
t-test on the classification accuracy differences. Overall, we see the ClassSetRBMs obtain good results 
compared to the many baselines. In particular, the best performing variant, ClassSetRBM'^^^* , has the 
highest average accuracy over all datasets and is never statistically worse than the best reference. Most 
importantly, ClassSetRBM^*^^* clearly outperforms ClassRBM-poolIn and ClassRBM-maxOut , which 
confirms the usefulness of having a pooling mechanism at the level of the hidden layer, as opposed to at 
the input or output level. Hybrid training does not clearly improve over purely discriminative training. 
This might be explained by the fact that the ClassSetRBMs modeled the input units (scaled in [0, 1]) as 



Personal communication with the authors of (Zhou et al.||2009|) 



http : / /www. cs . Columbia. edu/~andrews/mil/datasets .html 
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Model 


Muskl 


Musk2 


Elephant 


Fox 


Tiger 


Average 


ClassSctRBM''°^ 


83.04 


80.39 


82.30 


58.50 


82.40 


77.33 


ClassSetRBM'^^^Hybrid 


84.57 


84.12 


82.80 


55.70 


82.10 


77.86 


ClassSctRBM'^^^'*' 


83.91 


84.12 


87.80 


60.30 


82.60 


79.75 


ClassSctRBM-'^"^"' ' Hybrid 


83.70 


81.18 


86.40 


."•)«.()() 


83.20 


r«..50 


ClassSetRBM"^ 


82.61 


83.73 


82.70 


58.60 


80.90 


77.71 


ClassSctRBM^'^Hybrid 


84.35 


84.71 


82.60 


55.50 


82.70 


77.97 


ClassSetRBM°^* 


85.65 


80.39 


87.10 


59.70 


82.60 


79.09 


ClassSetRBM°^*Hybrid 


85.87 


81.76 


85.50 


56.60 


82.60 


78.47 


ClassRBM-poolIn 


81.52 


81.37 


82.70 


59.80 


76.80 


76.44 


ClassRBM-maxOut 


83.91 


80.98 


81.60 


57.60 


75.50 


75.92 


MLP-maxOut 


85.65 


78.82 


82.00 


55.40 


74.40 


75.25 


Logit-maxOut 


81.09 


80.00 


81.90 


58.80 


75.90 


75.54 


SVM-miGraph 




82.35 


83.80 




81.20 


79.12 


SVM-miGraph2 


85.43 


82.35 


83.80 


61.30 


80.80 


78.74 


SVM-max 


83.48 


84.51 


84.60 


59.70 


81.70 


78.80 



Table 1: Classification accuracies (%) on MIL datasets 



binary variables. The use of a "hard" max pooling however does appear to be quite useful and almost 
consistently improves on the softmax variant. 

5.2 Experiments on mail categorization 

The task which motivated this work is image document categorization, where the documents are pieces 
of mail that can be considered as sets of pages. These pages can be printed or handwritten letters, official 
papers, forms, envelops or white pages. The main application is mailroom automation, which is of great 
interest for large organizations where the volume of incoming mail can reach tens of thousands per day 
and must be processed within a couple of days. Routing of documents can then be done automatically 
with a classifier embedded in the document management system. 

Each image of a page is processed by an OCR for printed and handwritten text, which produces binary 
input features that correspond to the presence/absence of a given word in the page. The vocabulary size 
is limited to the 10000 most frequently recognized words. Other features from image analysis are also 
appended: 

• Sub-resolution (48 features): These are average gray-scaled pixel values on a 6 x 8 regular grid. 

• Document Layout Analysis (17 features): The document image is segmented into zones correspond- 
ing to boxes, lines, printed and handwritten text. We then compute some geometric statistics on 
each type of zones. 

• Predefined page class detectors (18 features): Each detector was designed to detect a common type 
of pages, such as bank checks, cursive/printed letters and different kinds of official papers. The 
output recognition score is used as a feature. 
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Model 


DsUs 


DsDe 


ClassSetRBM^^"- 


87.71 


83.75 


ClassSetRBM^°^* 


88.55 


84.18 1 


ClassSetRBM°^ 


88.13 


83.86 


ClassSetRBM°^* 


85.69 


78.43 


ClassRBM-poolIn 


87.97 


82.54 


SVM-miGraph 


86.22 


83.74 


SVM-miGraph2 


86.82 


84.16 


SVM-max 


56.45 


64.14 



Table 2: Classification accuracies (%) on mail classification task 



The resulting feature vectors are high-dimensional, sparse, and noisy to some extent (the word error 
rate of OCR typically lies between 5% and 50%). The application of mail classification typically does not 
fit well the assumptions made in MIL. In particular, we expect each page to provide only partial clues for 
predicting the set label, such that considering label assignments at the page level is not natural. In other 
words, while there might be enough informative pages to confidently identify a set's label, this does not 
imply that this label is appropriate for any one of these pages individually. Moreover, this problem is a 
multiclass classification problem, while most MIL algorithms are developed for binary classification and 
do not always generalize clearly to the multiclass setting, because of the asymmetric definition (at least 
one positive label vs. all negative labels) of MIL. 

We carried out experiments on two collected datasets for mail classification: DsDe (14 593 pieces of 
mail, 102 071 pages, 11 classes) and DsUs (16 808 pieces of mail, 160 372 pages, 6 classes). Table [2]^ a) 
shows the 5-fold cross-validation average accuracy of different models. It is also important in mailroom 
automation to be able to reject pieces of mail for which the prediction is too uncertain: the pieces of mail 
rejected by the system can then be processed by a human agent, in order to limit classification errors of 
the whole process. The rejection is done by comparing the classifier's output confidence to a threshold, 
thus this confidence estimate has to be reliable. A standard way to evaluate the goodness of the output 



confidence scores in multiclass classification is to plot micro-averaged recall and precision (Sebastiani 
2002) for different values of the rejection threshold. This is shown in Figure [4]^b,c 



Again, we observe that ClassSetRBM^*^^* achieves the best performance overall. We emphasize that, 
for both these two large datasets, ClassSetRBM-^^^* was much faster to train than the SVM approaches. 



6 Conclusion 

We described how the classification restricted Boltzmann machine could be adapted to problems where 
the inputs correspond to sets of vectors. Different generalizations for this problem were investigated, with 
one of these variant achieving consistent, competitive performance on multiple-instance learning datasets. 
It was also applied with success to a mail categorization task. Our experiments confirm the usefulness of 
pooling at the hidden representation level, as opposed to the input or output level. Directions for future 
work include applying this framework in deep neural architectures. 
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20 40 60 80 100 20 40 60 80 100 



micro-averaged RECALL (%) micro-averaged RECALL (%) 

(a) Precision/Recall curves on DsUs (b) Precision/Recall curves on DsDe 

Figure 4: Precision/recall curves for mail classification problem. 
Acknowledgements 

Hugo acknowledges the financial support of NSERC. 



Appendix 

Derivation of the free-energy for ClassSetRBM^°^ 

We provide here the derivation of the free-energy for ClassSetRBM'^^^. To simplify the derivation, we 
assume hidden layer sizes of H = 1. The generalization to arbitrary size is trivial, since the necessary sums 
factorize for each hidden unit, for the same reason that the conditional over H given X and y factorizes 
into each of the j^^ hidden unit sets {hj}. 

P{y = ec|x) = ^p(y = ec, H|x) 
H 

EHexp(-i?(X,y,H)) 



Eh' Ec'=i...cexp(-i?(X,e,,,HO) 



(2) 
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where 

J^exp(-S(X,y,H)) 



H 



H \ s s s ) 

= exp{dJy + b^x^^)) E ( E ^^^^'^ + E (h^'^^Wx(^) + h^^^^Uy 

s H \ s s 

= exp(dTy + ^bV^)) ^exp(0) + ^exp(ci+Wi.x(") + Ui.y)^ ( 

= exp(d^y + ^bTx(^)) ^l + exp(Ui.y)E^^p(^i + Wi-^^'0) 

where at hnejsjwe used the mutual exclusivity constraint h^{^^ G {0, 1} over H. Hence we can write 

log Y exp(-^(X, y, H)) = d^y + ^ b^x^^) + softplus ^Ui.y + log exp (ci + Wi.x^")) ^ ^ 

= d^y + Y b^x(") + softplus (Ui.y + softmaxi(X)) 



where we use the definition of softmaxj(X) for the ClassSetRBM'^^^ (see Section 
Equation [2] 

EHexp(-i?(X,y,H)) 



3.1 ). Going back to 



Ply = ec|xj 



Eh' Ec'=i...c^M-E{-X,e,,,U')) 

exp (d'^y + J2s b^x(*) + softplus (Ui.y + softmaxi(X))) 
'^c'=i c ^-^P (d^^c' + J2s b^x('*) + softplus (Ui.ec', +softmaxi(X))) 

exp (d'''y + softplus (Ui.y + softmaxi(X))) 
^c'=i C ^^P (d^^c' + softplus (Ui.ec', +softmaxi(X))) 

exp(-F^OR(X,y)) 
Ec'=i...cexp(-FXOR(x,e,,)) 

where we recover F'^^^(X, y) = —d^y — softplus (Ui.y + softmaxi(X)) for H = 1. Because of the 
hidden unit factorization property, we get the general free-energy function F^^^(X, y) = — d'''y — 
Ylf=i softplus (Uj.y + softmaxj(X)) for arbitrary values of H. 
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Derivation of the free-energy for ClassSetRBM^^ 

Again, we provide the derivation of the free-energy for ClassSetRBM^o^. Here, we can also simplify the 
derivation by we assuming hidden layers of size H = 1. 

^?(y = ec|x) = ^^p(y = ec,H,G|x) 

G H 

EGEHexp(--E(X,y,H,G)) 



Eg' Eh' Ec'=i...c exp(-E(X, , H', G')) 

where 



exp (^c^h(*) + h(^)Wx(*)) 

\H s.t. ^^"^=1 



(4) 



(5) 



^^exp(-E(X,y,H,G)) 

G H 

exp ("d^y + ^bV*) +^cTh(^) + ^ (hW^Wx(^) +g(^)^Uy)') 

G H \ s s s / 

exp(dTy + ^ bTx(^)) E <^^P f E ^^^^'^ + E (h^'^^Wx(^) + g(^)^Uy) 

s G H \ s s 

exp(d^y + ^b^x(^)) l^exp(O) ^^E^^^P (c^^^*) + h(^)^Wx(^))^ 

+ E exp(Ui.y 

s 

exp(dV + Eb^x(^^) (exp(0)[](l + exp(ci+Wi.x(*'))) (6) 

s \ s' 

+ E exp(Ui.y + ci + Wi.x(^)) + (^i + Wi.x^^')) ) j 

s s'^s / 
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where at line [5] we used the mutual exclusivity constraint J^sdi'^ ^ i^) 1} over G, and at line [g] we used 
equality constraint between H a: 

logj;^exp(-i?(X,y,H,G)) 

G H 

d^y + b^x^") + softplus(ci + Wi.x(^')) + softplus [ Ui.y + log ^ - 

S s' \ s 



(s) (s) 

the inequality constraint between H and G, h\ > \/ s. Hence we can write 



G H 

exp(ci + Wi.x^**)) 



+ exp(ci + Wi.x(«)) 



d^y + Y b^x^"*) + J2 softplus(ci + Wi.x^'*')) + softplus f Ui.y + log ^ exp(softminus(ci + Wi.x^^^)) 

s s' \ s / 

d^y + Y ^^^'^ + Yl softplus(ci + Wi.x("')) + softplus (Ui.y + softmaxi(X)) 



where we use the definition of softmaxj(X) for the ClassSetRBM^^ (see Section 3.2). Going back to 
Equation |4j 

= = EHEGexp(-i^(X,y,H,G)) 

^ EH'EG'i:c'=i...cexp(-^(X,e,,,H',G')) 

exp (d~'^y + softplus (Ui.y + softmaxi(X))) 
^(j=\ c exp(d~''ec' + softplus (Ui.ec' + softmaxi(X))) 

exp(-FOR(X,y)) 
Ec'=i...cexp(-FOR(X,eeO) 

where we recover F'-'^(X, y) = — d^y — softplus (Ui.y + softmaxi(X)) for H = 1. Again, because 
of the hidden unit factorization property of ClassSetRBM*^^, we get the general free-energy function 
F*-'^(X, y) = — d^y — J2f=i softplus (U^.y + softmaxj (X) ) for arbitrary values of H. 
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